Context
Imagine driving through town and a coupon is delivered to your cell phone for a restaraunt near where you are driving. Would you accept that coupon and take a short detour to the restaraunt? Would you accept the coupon but use it on a sunbsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaraunt? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?
Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?
Overview
The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.
Data
This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’. There are five different types of coupons -- less expensive restaurants (under \$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\\$20 - \$50).
Deliverables
Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons. To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece.
The attributes of this data set include:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px
Use the prompts below to get started with your data analysis.
coupons.csv file.data = pd.read_csv('data/coupons.csv')
data.shape
(12684, 26)
data.head()
| destination | passanger | weather | temperature | time | coupon | expiration | gender | age | maritalStatus | ... | CoffeeHouse | CarryAway | RestaurantLessThan20 | Restaurant20To50 | toCoupon_GEQ5min | toCoupon_GEQ15min | toCoupon_GEQ25min | direction_same | direction_opp | Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | No Urgent Place | Alone | Sunny | 55 | 2PM | Restaurant(<20) | 1d | Female | 21 | Unmarried partner | ... | never | NaN | 4~8 | 1~3 | 1 | 0 | 0 | 0 | 1 | 1 |
| 1 | No Urgent Place | Friend(s) | Sunny | 80 | 10AM | Coffee House | 2h | Female | 21 | Unmarried partner | ... | never | NaN | 4~8 | 1~3 | 1 | 0 | 0 | 0 | 1 | 0 |
| 2 | No Urgent Place | Friend(s) | Sunny | 80 | 10AM | Carry out & Take away | 2h | Female | 21 | Unmarried partner | ... | never | NaN | 4~8 | 1~3 | 1 | 1 | 0 | 0 | 1 | 1 |
| 3 | No Urgent Place | Friend(s) | Sunny | 80 | 2PM | Coffee House | 2h | Female | 21 | Unmarried partner | ... | never | NaN | 4~8 | 1~3 | 1 | 1 | 0 | 0 | 1 | 0 |
| 4 | No Urgent Place | Friend(s) | Sunny | 80 | 2PM | Coffee House | 1d | Female | 21 | Unmarried partner | ... | never | NaN | 4~8 | 1~3 | 1 | 1 | 0 | 0 | 1 | 0 |
5 rows × 26 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12684 entries, 0 to 12683 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 destination 12684 non-null object 1 passanger 12684 non-null object 2 weather 12684 non-null object 3 temperature 12684 non-null int64 4 time 12684 non-null object 5 coupon 12684 non-null object 6 expiration 12684 non-null object 7 gender 12684 non-null object 8 age 12684 non-null object 9 maritalStatus 12684 non-null object 10 has_children 12684 non-null int64 11 education 12684 non-null object 12 occupation 12684 non-null object 13 income 12684 non-null object 14 car 108 non-null object 15 Bar 12577 non-null object 16 CoffeeHouse 12467 non-null object 17 CarryAway 12533 non-null object 18 RestaurantLessThan20 12554 non-null object 19 Restaurant20To50 12495 non-null object 20 toCoupon_GEQ5min 12684 non-null int64 21 toCoupon_GEQ15min 12684 non-null int64 22 toCoupon_GEQ25min 12684 non-null int64 23 direction_same 12684 non-null int64 24 direction_opp 12684 non-null int64 25 Y 12684 non-null int64 dtypes: int64(8), object(18) memory usage: 2.5+ MB
data.isnull().sum()
destination 0 passanger 0 weather 0 temperature 0 time 0 coupon 0 expiration 0 gender 0 age 0 maritalStatus 0 has_children 0 education 0 occupation 0 income 0 car 12576 Bar 107 CoffeeHouse 217 CarryAway 151 RestaurantLessThan20 130 Restaurant20To50 189 toCoupon_GEQ5min 0 toCoupon_GEQ15min 0 toCoupon_GEQ25min 0 direction_same 0 direction_opp 0 Y 0 dtype: int64
data.isnull().sum().sum()
13370
data = data.drop(['car'], axis=1)
data = data.fillna(data.mean())
data.isnull().sum()
destination 0 passanger 0 weather 0 temperature 0 time 0 coupon 0 expiration 0 gender 0 age 0 maritalStatus 0 has_children 0 education 0 occupation 0 income 0 Bar 107 CoffeeHouse 217 CarryAway 151 RestaurantLessThan20 130 Restaurant20To50 189 toCoupon_GEQ5min 0 toCoupon_GEQ15min 0 toCoupon_GEQ25min 0 direction_same 0 direction_opp 0 Y 0 dtype: int64
data.coupon.value_counts(normalize=True)
Coffee House 0.315043 Restaurant(<20) 0.219647 Carry out & Take away 0.188663 Bar 0.159019 Restaurant(20-50) 0.117629 Name: coupon, dtype: float64
coupon column.# plot a hystogram using Plotly
fig = px.histogram(data, x = "coupon", text_auto = True)
fig
# plot a hystogram using Plotly
fig = px.histogram(data, x = "temperature", text_auto = True)
fig
data.columns
Index(['destination', 'passanger', 'weather', 'temperature', 'time', 'coupon',
'expiration', 'gender', 'age', 'maritalStatus', 'has_children',
'education', 'occupation', 'income', 'Bar', 'CoffeeHouse', 'CarryAway',
'RestaurantLessThan20', 'Restaurant20To50', 'toCoupon_GEQ5min',
'toCoupon_GEQ15min', 'toCoupon_GEQ25min', 'direction_same',
'direction_opp', 'Y'],
dtype='object')
Investigating the Bar Coupons
Now, we will lead you through an exploration of just the bar related coupons.
DataFrame that contains just the bar coupons.bar = data[['Bar']]
bar.head()
| Bar | |
|---|---|
| 0 | never |
| 1 | never |
| 2 | never |
| 3 | never |
| 4 | never |
bar.value_counts(normalize=True)
Bar never 0.413215 less1 0.276855 1~3 0.196629 4~8 0.085553 gt8 0.027749 dtype: float64
threeLess = ['less1','1~3']
threeMore = ['4~8','gt8']
empty = []
for i,k in bar.iterrows():
if k['Bar'] in threeLess:
empty.append('3 or fewer')
elif k['Bar'] in threeMore:
empty.append('4 or more')
else:
empty.append('never')
bar['compare'] = empty
<ipython-input-18-426d9e03480b>:16: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# plot a hystogram using Plotly
fig = px.histogram(data, x = "Bar", color='Y', text_auto = True)
fig
data.Bar = data.Bar.fillna('never')
morethanonce = ['1~3','4~8','gt8']
lessthanonce = ['less1','never']
empty = []
for i,k in data.iterrows():
if k['Bar'] in morethanonce:
empty.append('morethanonce')
elif k['Bar'] in lessthanonce:
empty.append('lessthanonce')
else:
empty.append('nothing')
data['bar2'] = empty
data.bar2.value_counts()
lessthanonce 8786 morethanonce 3898 Name: bar2, dtype: int64
older25 = ['25+','26','31','50plus','36','41','46']
young25 = ['below21','21']
empty = []
for i,k in data.iterrows():
if k['age'] in older25:
empty.append('over 25')
elif k['age'] in young25:
empty.append('less than 25')
else:
empty.append('nothing')
data['age2'] = empty
data['age2'].value_counts()
over 25 9484 less than 25 3200 Name: age2, dtype: int64
data.head()
| destination | passanger | weather | temperature | time | coupon | expiration | gender | age | maritalStatus | ... | RestaurantLessThan20 | Restaurant20To50 | toCoupon_GEQ5min | toCoupon_GEQ15min | toCoupon_GEQ25min | direction_same | direction_opp | Y | bar2 | age2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | No Urgent Place | Alone | Sunny | 55 | 2PM | Restaurant(<20) | 1d | Female | 21 | Unmarried partner | ... | 4~8 | 1~3 | 1 | 0 | 0 | 0 | 1 | 1 | lessthanonce | less than 25 |
| 1 | No Urgent Place | Friend(s) | Sunny | 80 | 10AM | Coffee House | 2h | Female | 21 | Unmarried partner | ... | 4~8 | 1~3 | 1 | 0 | 0 | 0 | 1 | 0 | lessthanonce | less than 25 |
| 2 | No Urgent Place | Friend(s) | Sunny | 80 | 10AM | Carry out & Take away | 2h | Female | 21 | Unmarried partner | ... | 4~8 | 1~3 | 1 | 1 | 0 | 0 | 1 | 1 | lessthanonce | less than 25 |
| 3 | No Urgent Place | Friend(s) | Sunny | 80 | 2PM | Coffee House | 2h | Female | 21 | Unmarried partner | ... | 4~8 | 1~3 | 1 | 1 | 0 | 0 | 1 | 0 | lessthanonce | less than 25 |
| 4 | No Urgent Place | Friend(s) | Sunny | 80 | 2PM | Coffee House | 1d | Female | 21 | Unmarried partner | ... | 4~8 | 1~3 | 1 | 1 | 0 | 0 | 1 | 0 | lessthanonce | less than 25 |
5 rows × 27 columns
morethanonce = ['1~3','4~8','gt8']
lessthanonce = ['less1','never']
empty = []
for i,k in data.iterrows():
if k['bar2'] == 'morethanonce' and k['age2'] == 'over 25':
empty.append('bar 1+/month and 25+')
elif k['bar2'] == 'lessthanonce' and k['age2'] == 'less than 25':
empty.append('all_others')
else:
empty.append('nothing')
data['bar_and_age'] = empty
data.bar_and_age.value_counts()
nothing 7828 bar 1+/month and 25+ 2777 all_others 2079 Name: bar_and_age, dtype: int64
# plot a hystogram using Plotly
fig = px.histogram(data, x='bar_and_age', color='Y', text_auto=True)
fig
morethanonce = ['1~3','4~8','gt8']
lessthanonce = ['less1','never']
empty = []
for i,k in data.iterrows():
if k['Bar'] in morethanonce:
empty.append('morethanonce')
elif k['Bar'] in lessthanonce:
empty.append('lessthanonce')
else:
empty.append('nothing')
data['bar2'] = empty
data.passanger.value_counts()
Alone 7305 Friend(s) 3298 Partner 1075 Kid(s) 1006 Name: passanger, dtype: int64
data.occupation.value_counts()
Unemployed 1870 Student 1584 Computer & Mathematical 1408 Sales & Related 1093 Education&Training&Library 943 Management 838 Office & Administrative Support 639 Arts Design Entertainment Sports & Media 629 Business & Financial 544 Retired 495 Food Preparation & Serving Related 298 Healthcare Practitioners & Technical 244 Healthcare Support 242 Community & Social Services 241 Legal 219 Transportation & Material Moving 218 Protective Service 175 Personal Care & Service 175 Architecture & Engineering 175 Life Physical Social Science 170 Construction & Extraction 154 Installation Maintenance & Repair 133 Production Occupations 110 Building & Grounds Cleaning & Maintenance 44 Farming Fishing & Forestry 43 Name: occupation, dtype: int64
#occupations other than , , or
occupation = ['farming','fishing','forestry']
morethanonce = ['1~3','4~8','gt8']
lessthanonce = ['less1','never']
empty = []
for i,k in data.iterrows():
if k['bar2'] == 'morethanonce' and k['passanger'] != 'Kid(s)' and k['occupation'] not in occupation:
empty.append('bar 1+/month-noKids')
# elif k['bar2'] == 'lessthanonce' and k['age2'] == 'less than 25':
# empty.append('all_others')
else:
empty.append('all others')
data['bar_age_occupation'] = empty
data.bar_age_occupation.value_counts()
all others 8988 bar 1+/month-noKids 3696 Name: bar_age_occupation, dtype: int64
# plot a hystogram using Plotly
fig = px.histogram(data, x='bar_age_occupation', color='Y', text_auto=True)
fig
go to cheap restaurants more than 4 times a month and income is less than 50K
RestaurantLessThan20
Restaurant20To50
data.RestaurantLessThan20.value_counts()
1~3 5376 4~8 3580 less1 2093 gt8 1285 never 220 Name: RestaurantLessThan20, dtype: int64
data.income.value_counts()
$25000 - $37499 2013 $12500 - $24999 1831 $37500 - $49999 1805 $100000 or More 1736 $50000 - $62499 1659 Less than $12500 1042 $87500 - $99999 895 $75000 - $87499 857 $62500 - $74999 846 Name: income, dtype: int64
morethanfour = ['4~8','gt8']
incomes = ['$12500 - $24999', '$25000 - $37499', '$37500 - $49999']
empty = []
for i,k in data.iterrows():
if k['RestaurantLessThan20'] in morethanfour and k['income'] in incomes:
empty.append('moreThanFour_lessThan50K')
# elif k['Bar'] in lessthanonce:
# empty.append('lessthanonce')
else:
empty.append('all others')
data['restaurants_income'] = empty
fig = px.histogram(data, x='restaurants_income', color='Y', text_auto=True)
fig
Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12684 entries, 0 to 12683 Data columns (total 30 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 destination 12684 non-null object 1 passanger 12684 non-null object 2 weather 12684 non-null object 3 temperature 12684 non-null int64 4 time 12684 non-null object 5 coupon 12684 non-null object 6 expiration 12684 non-null object 7 gender 12684 non-null object 8 age 12684 non-null object 9 maritalStatus 12684 non-null object 10 has_children 12684 non-null int64 11 education 12684 non-null object 12 occupation 12684 non-null object 13 income 12684 non-null object 14 Bar 12684 non-null object 15 CoffeeHouse 12467 non-null object 16 CarryAway 12533 non-null object 17 RestaurantLessThan20 12554 non-null object 18 Restaurant20To50 12495 non-null object 19 toCoupon_GEQ5min 12684 non-null int64 20 toCoupon_GEQ15min 12684 non-null int64 21 toCoupon_GEQ25min 12684 non-null int64 22 direction_same 12684 non-null int64 23 direction_opp 12684 non-null int64 24 Y 12684 non-null int64 25 bar2 12684 non-null object 26 age2 12684 non-null object 27 bar_and_age 12684 non-null object 28 bar_age_occupation 12684 non-null object 29 restaurants_income 12684 non-null object dtypes: int64(8), object(22) memory usage: 2.9+ MB
data.education.value_counts()
Some college - no degree 4351 Bachelors degree 4335 Graduate degree (Masters or Doctorate) 1852 Associates degree 1153 High School Graduate 905 Some High School 88 Name: education, dtype: int64
fig = px.histogram(data, x='education', color='Y', text_auto=True)
fig
fig = px.histogram(data, x='weather', color='Y', text_auto=True)
fig
data.gender.value_counts('normalize=True')
Female 0.513324 Male 0.486676 Name: gender, dtype: float64
fig = px.histogram(data, x='time', color='Y', text_auto=True)
fig
# plot a hystogram using Plotly
fig = px.histogram(data, x = "coupon", color='Y', text_auto = True)
fig